Skip to content

Vector datasets catalog and downloader#7446

Merged
connortsui20 merged 1 commit intodevelopfrom
ct/vector-datasets
Apr 15, 2026
Merged

Vector datasets catalog and downloader#7446
connortsui20 merged 1 commit intodevelopfrom
ct/vector-datasets

Conversation

@connortsui20
Copy link
Copy Markdown
Contributor

Summary

Tracking issue: #7297

We will want to add vector benchmarking soon (see #7399 for a draft).

This adds a simple catalog for the vector datasets hosted by https://assets.zilliz.com/benchmark for VectorDBBench, which both describes the shape of the datasets (are things partitioned, randomly shuffled, are there neighbors lists for top k, etc).

Also handles downloading everything.

I had to verify that all of this stuff was correct by looking at the S3 buckets themselves:

aws s3 ls s3://assets.zilliz.com/benchmark/ --region us-west-2 --no-sign-request
Details
for d in bioasq_large_10m bioasq_medium_1m cohere_large_10m cohere_medium_1m \
         cohere_small_100k gist_medium_1m gist_small_100k glove_medium_1m \
         glove_small_100k laion_large_100m  \
         openai_large_5m openai_medium_500k openai_small_50k \
         sift_large_50m sift_medium_5m sift_small_500k; do
  echo "=== $d ==="
  aws s3 ls s3://assets.zilliz.com/benchmark/$d/ --region us-west-2 --no-sign-request
done

And this script from the main repo helped too: https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/dataset.py


Things that are not implemented that I would like to add:

  • Is the dataset pre-normalized for cosine similarity? This is not so obvious to me without actually working with the datasets, so I will do this later.
  • Some datasets have scalar labels for all vectors that help mimic similarity + filter by some other column. Some of them also have neighbor lists for these specific filtered queries. So that is something we'll probably want to add in the future.

Testing

N/A

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 15, 2026

Merging this PR will not alter performance

✅ 1153 untouched benchmarks
⏩ 1455 skipped benchmarks1


Comparing ct/vector-datasets (5d70dd0) with develop (9406303)

Open in CodSpeed

Footnotes

  1. 1455 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@connortsui20
Copy link
Copy Markdown
Contributor Author

We probably want to mirror this somewhere. @AdamGS @robert3005 is there an easy way to do this?

@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented Apr 15, 2026

R2 is probably the easiest? Whatever we use for the clickbench data

}

/// Stream a large file to disk with a byte-progress bar.
async fn download_with_progress(client: &Client, url: &str, output: &PathBuf) -> Result<()> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need another one of these? Can't this be part of the general download utils we have here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this function and a bunch of the following ones)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe? But afaict we dont use a reqwest client anywhere else

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we definitely do
Screenshot 2026-04-15 at 18 30 53

Comment thread vortex-bench/src/vector_dataset/download.rs Outdated
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, nothing objectionable here, lets ship it

@connortsui20
Copy link
Copy Markdown
Contributor Author

ok im yoloing this since this doesnt affect anyone else. At some point it would be good to unify the downloading but if we are going to do that then we might as well implement the catalog idea that @joseph-isaacs had.

@connortsui20 connortsui20 merged commit bff43dc into develop Apr 15, 2026
62 checks passed
@connortsui20 connortsui20 deleted the ct/vector-datasets branch April 15, 2026 18:23
@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented Apr 15, 2026

@joseph-isaacs what's the catalog idea? worth writing down somewhere?

connortsui20 added a commit that referenced this pull request Apr 16, 2026
## Summary

Tracking issue: #7297

Adds a TurboQuant demo where we convert the parquet files to a Vortex
file (in-memory only now, but still serialized as bytes), and then we
verify by decoding and performing a basic cosine similarity expression
search with a filter pushdown.

This is based on top of #7446,
please dont merge until that has merged

## Testing

The example runs!

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants